feature: Support Spark expression: arrays_zip by hsiang-c · Pull Request #3643 · apache/datafusion-comet

hsiang-c · 2026-03-07T00:20:37Z

Which issue does this PR close?

Rationale for this change

This PR supports Spark-compatible arrays_zip SQL function

What changes are included in this PR?

The Rust implementation copied and modified arrays_zip_inner implementation from DataFusion 53.0.0 to support custom field name in the resulting struct. It also included subsequent patches: feat: correct struct column names for arrays_zip return type datafusion#20886 and fix: arrays_zip/list_zip allow single array argument datafusion#21047.
At QueryPlan SerDe stage, if one of the arrays_zip arguments is NULL, we return NULL as Spark does. Examples of Spark's arrays_zip:

scala> spark.sql("SELECT arrays_zip(array(1, 2, 3), array(1), NULL)").show(100, false)
+------------------------------------------+
|arrays_zip(array(1, 2, 3), array(1), NULL)|
+------------------------------------------+
|NULL                                      |
+------------------------------------------+

scala> spark.sql("SELECT arrays_zip(NULL, array(1, 2, 3))").show(100, false)
+--------------------------------+
|arrays_zip(NULL, array(1, 2, 3))|
+--------------------------------+
|NULL                            |
+--------------------------------+

How are these changes tested?

By SQL File Tests, we covered cases such as single array argument, nested arrays, arrays of supported types, null arguments and custom field name in the resulting struct. Here is an example of custom file name:

scala> spark.sql("SELECT arrays_zip(b, a) FROM (SELECT array(1, 2, 3) as a, array(1, 2) as b)").show(100, false)
+---------------------------+
|arrays_zip(b, a)           |
+---------------------------+
|[{1, 1}, {2, 2}, {NULL, 3}]|
+---------------------------+


scala> spark.sql("SELECT arrays_zip(b, a) FROM (SELECT array(1, 2, 3) as a, array(1, 2) as b)").printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
 |-- arrays_zip(b, a): array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- b: integer (nullable = true)
 |    |    |-- a: integer (nullable = true)

comphead · 2026-03-07T00:25:47Z

#3575

hsiang-c · 2026-03-07T00:36:35Z

Thanks @comphead

comphead

Thanks @hsiang-c I'm not sure this implementation actually uses the DataFusion

hsiang-c · 2026-04-14T19:46:44Z

@comphead Thanks for your review. This implementation doesn't use DataFusion for now b/c I need to pass names argument from Spark to arrays_zip_inner to parameterized the field key in the final struct.

parthchandra · 2026-04-15T00:27:12Z

+    // mimic Spark's ArraysZip behavior: returns NULL if any argument is NULL
+    val combinedNullCheck = expr.children.map(child => IsNotNull(child)).reduce(And)
+    val isNotNullExpr = exprToProtoInternal(combinedNullCheck, inputs, binding)
+    val nullLiteralProto = exprToProto(Literal(null, BooleanType), Seq.empty)


The null literal here uses BooleanType, but elsewhere in this file (e.g., CometArrayAppend at line 88) we use the return type of the expression. DF expects all arms of casewhen to have compatible types and this may cause an error.

@parthchandra Thanks Parth, good catch!

parthchandra · 2026-04-15T00:33:06Z

+object CometArraysZip extends CometExpressionSerde[ArraysZip] {
+  override def getSupportLevel(expr: ArraysZip): SupportLevel = {
+    expr.dataType match {
+      case _: ArrayType => Compatible()


We should probably check the element type here. There have been issues noted in the past. See this for instance - #1308

parthchandra · 2026-04-15T00:36:07Z

+        let fields = self.fields(input_schema)?;
+        Ok(List(Arc::new(Field::new_list_field(
+            DataType::Struct(Fields::from(fields)),
+            true,


There is a slight mismatch here. Spark has this defined as non-nullable.

Good catch!

parthchandra · 2026-04-15T00:37:31Z

                )))
            }
+            ExprStruct::ArraysZip(expr) => {
+                assert!(!expr.values.is_empty());


Better to return Err instead of asserting (which will cause a panic).

return Err(GeneralError("arrays_zip requires at least one argument".to_string()))

If you want to be extra safe, then you can also check

expr.values.len() == expr.names.len()

Makes sense. I fixed the first one. Thanks Parth.

The 2nd check on length makes sense. Spark's ArraysZip has done the same check at https://github.com/apache/spark/blob/branch-4.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L313-L315. I think we should be safe here b/c of Spark.

parthchandra · 2026-04-17T23:58:03Z

+    val inputTypes = expr.children.map(_.dataType).toSet
+    for (dt <- inputTypes) {
+      if (!isTypeSupported(dt)) {
+        Unsupported(Some(s"Unsupported child data type: $dt"))


Suggested change

Unsupported(Some(s"Unsupported child data type: $dt"))

return Unsupported(Some(s"Unsupported child data type: $dt"))

Otherwise you're always falling thru to compatible.

Good catch, sorry I missed this one.

parthchandra

lgtm. Some minor nits.
Nice work @hsiang-c

comphead

Thanks @hsiang-c I think to test the structure names we would need a query like

select arrays_zip(a, b)['a'] from (select array(1, 2, 3) a, array(3, 4, 5) b)

comphead · 2026-04-22T00:27:23Z

depends on #4024

hsiang-c force-pushed the arrays_zip branch 2 times, most recently from 97d31a4 to be5dfce Compare March 20, 2026 18:33

hsiang-c mentioned this pull request Mar 26, 2026

tests: add SQL file tests for arrays_zip #3794

Draft

hsiang-c force-pushed the arrays_zip branch 2 times, most recently from a261445 to 3a21e25 Compare April 13, 2026 17:44

hsiang-c added 7 commits April 13, 2026 10:58

Define ArraysZip expr proto

9275466

Create ArraysZip SerDe

138f76b

Register SerDe to arrayExpressions

9fdba4d

Add SQL test

82e403f

Register expression to planner

40502c4

Rust wrapper around DF's arrays_zip

f6f66c2

Null checks

9af23d8

hsiang-c force-pushed the arrays_zip branch from 3a21e25 to 9af23d8 Compare April 13, 2026 17:58

Merge branch 'main' into arrays_zip

a1f1718

hsiang-c marked this pull request as ready for review April 13, 2026 18:21

hsiang-c mentioned this pull request Apr 13, 2026

feat: use arrays_zip from DataFusion #3575

Open

hsiang-c and others added 4 commits April 13, 2026 14:42

Fix clippy

c2e7f67

Update supported Spark expressions doc

74d6e57

Merge branch 'main' into arrays_zip

5ca87ce

Merge branch 'main' into arrays_zip

b35e028

comphead reviewed Apr 14, 2026

View reviewed changes

parthchandra reviewed Apr 15, 2026

View reviewed changes

hsiang-c added 5 commits April 17, 2026 11:04

Use expr's return type

3cdab36

Check element type of each expr's child

c4f81ce

Align nullability with Spark

e2a8e94

Avoid panic in planner

adb058c

Add newlines

c3e2b2f

hsiang-c added 2 commits April 17, 2026 12:12

Merge branch 'main' into arrays_zip

4a7df41

Fix format

6cdf60a

parthchandra reviewed Apr 18, 2026

View reviewed changes

hsiang-c added 3 commits April 20, 2026 10:13

Align nullability with Spark

8ba6f7c

Fix type compatibility and more tests

8ba032b

Merge branch 'main' into arrays_zip

36be287

hsiang-c force-pushed the arrays_zip branch from 775762e to 36be287 Compare April 20, 2026 17:37

parthchandra approved these changes Apr 20, 2026

View reviewed changes

Comment thread spark/src/main/scala/org/apache/comet/serde/arrays.scala Outdated

Comment thread spark/src/test/resources/sql-tests/expressions/array/arrays_zip.sql Outdated

Comment thread spark/src/main/scala/org/apache/comet/serde/arrays.scala Outdated

hsiang-c and others added 3 commits April 20, 2026 13:18

Fix import; add newlines

352b36c

Merge branch 'main' into arrays_zip

4b23311

Merge branch 'main' into arrays_zip

f23faaa

comphead reviewed Apr 21, 2026

View reviewed changes

hsiang-c and others added 2 commits April 21, 2026 15:35

Test Struct names

8a985b5

Merge branch 'main' into arrays_zip

9e15b49

Merge branch 'main' into arrays_zip

b924de5

parthchandra merged commit 2bd01af into apache:main Apr 22, 2026
133 checks passed

	Unsupported(Some(s"Unsupported child data type: $dt"))
	return Unsupported(Some(s"Unsupported child data type: $dt"))

Conversation

hsiang-c commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

comphead commented Mar 7, 2026

Uh oh!

hsiang-c commented Mar 7, 2026

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

hsiang-c commented Apr 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsiang-c Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

comphead commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hsiang-c commented Mar 7, 2026 •

edited

Loading

hsiang-c Apr 17, 2026 •

edited

Loading